Heart diseases, if left undetected, pose a significant threat to global health, leading to severe complications such as stroke, which, according to the World Health Organization, is the \(2^{nd}\) leading cause of death worldwide, contributing to around 11% of total deaths. Particularly in Canada, heart diseases remain a pressing health issue with the Heart and Stroke Foundation reporting one person dying every five minutes due to heart disease, stroke, or related conditions, and being a primary cause of hospitalization.
In this datathon, we leverage the potential of Machine Learning and Exploratory Data Analysis for early prediction and detection of heart diseases using real-world datasets. The main objective of this datathon is to explore the efficacy of utilizing machine learning in the detection of cardiovascular events and understanding the clinical, social, and behavioral factors that might contribute to them. We will be utilizing the datasets provided for this purpose and hope to present a compelling and coherent data analysis story to our stakeholders.
Dataset #1: Mortality Dataset for Cardiovascular Disease Complications
This dataset, drawn from medical records of 299 heart failure patients, provides a valuable window into the world of cardiology. Collected at the Faisalabad Institute of Cardiology and Allied Hospital in Faisalabad, Pakistan, between April and December 2015, this dataset includes a diverse group of patients, ranging from 40 to 95 years old. These individuals all shared the common characteristic of left ventricular systolic dysfunction and were classified as New York Heart Association (NYHA) class III or IV, reflecting the severity of their heart conditions. With 13 distinct features, covering clinical, physiological, and lifestyle aspects, including binary indicators for conditions such as anaemia, high blood pressure, diabetes, sex, and smoking, this dataset offers a comprehensive view of patients’ health profiles. Explore key health metrics, including creatinine phosphokinase (CPK) levels, ejection fraction percentages, serum creatinine concentrations, and sodium levels, all of which play crucial roles in assessing cardiac and overall health.
| Attribute | Description | Measurement | Value Range |
|---|---|---|---|
| Patient’s Age | Age of the individual | Years | [40, …, 95] |
| Hemoglobin Level | Presence of low red blood cells or hemoglobin | Boolean | 0, 1 |
| Hypertension Status | High blood pressure | Boolean | 0, 1 |
| Blood Enzyme (CPK) | Level of CPK enzyme in the bloodstream | mcg/L | [23, …, 7861] |
| Diabetic Condition | Diabetes presence | Boolean | 0, 1 |
| Heart Ejection Rate | Percentage of blood ejected from the heart | Percentage | [14,…, 80] |
| Blood Platelet Count | Platelet concentration in the blood | kiloplatelets/mL | [25.01, …, 850.000] |
| Serum Creatinine Level | Concentration of creatinine in the blood | mg/dL | [0.50, …, 9.40] |
| Blood Sodium Level | Concentration of sodium in the bloodstream | mEq/dL | [114, …, 148] |
| Smoking Status | Smoking habit | Boolean | 0, 1 |
| Gender | Patient’s gender (0: Female, 1: Male) | Boolean | 0, 1 |
| Follow-up Duration | Duration of follow-up period | Days | [4, …,285] |
| Mortality Status | Mortality during follow-up | Boolean | 0, 1 |
Units Key:
Dataset #2: Public Health Factors Influencing BMI
Our second dataset, known as the Cardiovascular Event Dataset, encompasses a wide range of attributes related to patients’ health and lifestyle factors. With a substantial 5,110 entries, this dataset facilitates exploration and analysis of various aspects of cardiovascular health, including gender, age, hypertension, heart diseases, smoking status, and more. Each row in this dataset contains vital patient information, making it a valuable resource for investigating the complex interplay of factors contributing to cardiovascular events.
| Attribute | Description |
|---|---|
| id | Unique identifier |
| gender | “Male”, “Female” or “Other” |
| age | Age of the patient |
| hypertension | 0 if the patient doesn’t have hypertension, 1 if they do |
| heart_disease | 0 if the patient doesn’t have any heart diseases, 1 if they do |
| ever_married | “No” or “Yes” |
| work_type | “Children”, “Govt_jov”, “Never_worked”, “Private” or “Self-employed” |
| Residence_type | “Rural” or “Urban” |
| avg_glucose_level | Average glucose level in blood |
| bmi | Body mass index |
| smoking_status | “Formerly smoked”, “Never smoked”, “Smokes” or “Unknown”* |
| stroke | 1 if the patient had a stroke, 0 if not |
*Note: “Unknown” in smoking_status indicates that the information is unavailable for this patient.
The datasets can be found at
Modules/Datathon #2, and they will be provided at 6:45 pm on Tuesday, September 26, 2023 .
You are encouraged to discuss your work with your teammates and other teams and can use online and offline resources. However, all members of your team should make substantial, meaningful contributions to your submission, ensuring fairness to all participating teams in this datathon. Teams must submit the following materials by the 8:00 PM in-class deadline and the final deadline at 2:00 PM. It is advisable for teams to work consistently from the outset on deliverables rather than attempting to complete them all within the last hour. You should begin work on the deliverables at least three days before the deadline.
The first phase of this Datathon involves collaborative efforts among students, aimed at transforming the provided datasets into actionable insights. Teams should formulate research questions and outline their data analysis plans, followed by submitting a low-fidelity prototype of their solution to Assignments/Datathon#2/Low-fidelity Prototype. Please adhere to the naming convention outlined later in this document when naming your one-page PDF submission for today.
Every team is required to submit their low-fidelity prototype through Quercus by 8:00 PM on September 19, 2023. A successful submission should include a clear and legible list of research questions that you plan to address using the provided datasets. Additionally, provide a detailed plan specifying the analysis methods (e.g. machine learning) you intend to employ for addressing these questions. Ensure that each research question corresponds to its respective analysis plan.
Please note that you are not obligated to finalize your solution or research questions at this stage. If you come up with a better idea during the week, feel free to update your plan. The primary goal of the low-fidelity milestone is to initiate the brainstorming phase of a data science project, which is typically the initial and most critical phase. It allows you to see how the project’s direction may evolve during your analysis.
All teams are expected to submit their analysis results and deliver brief presentations (2 minutes for the presentation, followed by 1 minute for questions) consisting of a minimum of 2 and a maximum of 3 slides. The purpose of these presentations is to guide your instructor and TA(s) on how you leveraged the available data to address the research question you formulated.
During your presentation, cover essential elements, including meaningful results, the data analysis process, challenges encountered, and key findings. While you have the flexibility to decide the presentation’s content, it should focus on conveying a clear understanding of the analytical process, findings, and conclusions. In essence, the presentation should provide a condensed version of the written report.
To allow the TA to prepare teams’ presentations effectively, it is imperative that teams finalize their submissions by 2:00 PM on October 3, 2023.
Teams are required to compile a report that details the steps taken to address their proposed question or prompt. While there is not a prescribed format for the report, it should encompass key sections such as:
Note: When submitting your report to Quercus, please consolidate all components into one PDF file and include links to other relevant elements within the report. Name your file following the format: Team Number-CHL5230-F23 (e.g., 25-CHL5230-F23.PDF). Submissions not adhering to this naming convention will not be considered for grading. Additionally, ensure that you include your team number and the names of all team members in your report.
At a minimum, the report should cover the question addressed, findings, the data analysis process, and a conclusion. The report must not exceed two pages in length. While the code should be functional and produce the reported results, it will not be evaluated based on code quality.
Ensure that all materials are submitted by 2:00 pm, October 3rd. Unfortunately, no late submissions will be accepted.
This Datathon is pretty free-form! This is intentional; projects you work on in industry will rarely be very specific. Please feel free to show early results to me to get some feedback you can use to ensure a successful submission!
| Component | Due Time | Where to Submit? |
|---|---|---|
| Data Availability | September 26, 6:45 pm | Modules/Datathon #2 |
| Low-fidelity Prototype | September 26, 8:00 pm | Assignments/Datathon #2/Low-fidelity Prototype |
| Written Report | October 3, 2:00 pm | Assignments/Datathon #2/Written Report |